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Abstract 

This paper describes a hierarchical system that 
predicts one label at a time for automated stu¬ 
dent response analysis. For the task, we build 
a classihcation binary tree that delays more 
easily confused labels to later stages using hi¬ 
erarchical processes. In particular, the paper 
describes how the hierarchical classiher has 
been built and how the classihcation task has 
been broken down into binary subtasks. It h- 
nally discusses the motivations and fundamen¬ 
tals of such an approach. 
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1 Introduction 


One of the aims of educational natural language pro¬ 
cessing is to provide useful feedback to students 
in Learning Management Systems. In this sense, 
Dzikovska et al. (2013| ) proposed the Semeval-2013 
Student Response Analysis (SRA) task to automat¬ 
ically grade open question answers. The SRA task 
introduced different levels of granularity, and the re¬ 
sults of different systems showed that the higher the 
number of categories, the worst the overall results 
were. For example, regarding the best results, the 
macro-average FI value dropped from 0.72 in the 2- 
way task (2 categories) to 0.42 in the 5-way task (5 
categories). 

Our hypothesis states that the set of categories de¬ 
fined for the task in |Dzikovska et al. (2012 ) presents 
complex and gradual relationships. As a conse¬ 
quence, these relations may be indicative of poten¬ 
tial weakness to score the answers automatically. 
Given a set of labels (correct, partially_correct, ir¬ 
relevant, non.domain, contradictory) intuition dic¬ 
tates that we can expect major problems differen¬ 
tiating correct and partially.correct an- 
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Figure 1: Confusion matrix of a 5-way supervised classi¬ 
her on the Semeval-2013 training data. 


swers, than correct and non.domairQanswers. 
Within this context, the confusion matrix of a tuned 
5-way classifier (see figure [T]) gives us an appro- 
priafe way fo aufomafically idenfify fhe problem¬ 
atic label-relationship in order fo sef fruitful sfeps 
towards a beffer solution. Acfually, fhe confusion 
mafrix shows higher error rafe when discarding be- 
fween correct and partially_correct an¬ 
swers, fhan befween correct and non_domain 
answers. 


Previous work has shown fhaf binarizing mul- 
ficlass problems can be beneficial for improving 


the overall performance of some systems ( 

Lei and 

Govindaraju, 2005|l (|Allwein et al., 2001 

Lei and 

Govindaraju, 2005 

Marszalek and Schmid, 20081. 


In fhis work, we presenf a mefhod which walks 
fhrough disfincf nodes by making binary decisions 
af each sfep. Thus, decomposing the main task 
in several subtasks. SRA systems participating on 

*non_domain answers do not include question’s domain 
content. 

















































Figure 2: Agglomerative hierarchical clustering among 
the Semeval-2013 data. 


Semeval-2013 faced the problem as a 5-way classi¬ 
fication task. They assumed no hierarchy among the 
labels and ignored the difficulties which rely on the 
structured nature of the category set. 

On the contrary, we construct label-partitions and 
sort labels according to the confusion-relatedness 
(see figure |^. To the best of our knowledge, this 
is the first SRA system that tries to take advantage 
from the analysis of the label interdependencies. In 
sum, our work tries to induce structured informa¬ 
tion so that 1) we can prioritize over the label rela¬ 
tionships and 2) we can decompose the main task in 
simpler binary subtasks. 


2 Problem formulation and data 
description 

The Semeval-2013 SRA task addresses the problem 
of grading student responses from different science 
domains. The corpora for this task has been cre¬ 
ated out of two established sources: the BEETLE 
corpus, a data set collected and annotated during 
the evaluation of the BEETLE II tutorial dialogue 


system ( [Dzikovska et ah, 2010[ ); and the SCIENTS- 
BANK corpus, a set of student answers to questions 
from sixteen science modules in the Assessing Sci¬ 


ence Knowledge Assessment Inventory (Nielsen et 


ah, 20081. The objective of the task is to determine 
the label of the answer given an open question and a 
reference answer. Student answers can be classified 
at different levels of granularity: 5-way task, 3-way 
task and 2-way task. In the 5-way task, the one we 


follow in this work, the aim is to classify the correct¬ 
ness of the answers as correct, partially correct or 
incomplete, contradictory, irrelevant, and not in the 
domain. The whole task is completely documented 
in 


Dzikovska et al. (20131. 


To build, train and test our hierarchical system we 
use the same training and test sets as the EHUALM 
system does. EHUALM is an ensemble of distinct 
supervised classifiers that took part in the Semeval- 
2013 5-way task. It ranked considerably well above 
mean and median in different scenarios of the task. 
The full system is described in Aldabe et al. (2013] ). 
As regards the dataset, which is included in the sup¬ 
porting material, it comprises a total of 30 syntac¬ 
tic and semantic similarity features computed be¬ 
tween the question and the reference answer(s); but 
also, between the question and the student answer. 
The syntactic and semantic similarity features can 
be grouped into the following sets: text overlap 
features, WordNet-based features, graph-based fea¬ 
tures, corpus-based features and predicate-argument 
features. 


3 Hierarchy based system 

In section [T] we stated that the inner relation between 
labels is meaningful to model similarities among 
SRA. Such a relation can be used to build a hier¬ 
archical approach in order to consider the label rela¬ 
tionships inside the classifying model. We consider 
two labels to be similar if the output of a tuned 5-way 
classifier confuses them frequently. In other words, 
we expect higher difficulties among labels that are 
semantically close. 

In order to obtain relatedness values among the la¬ 
bels of the Semeval task we need to construct a dis¬ 
tance matrix. The distance matrix will then be used 
to construct a dendrogram that connects distinct la¬ 
bels according to their relatedness (see figure [^. As 
a first step we have calculated the similarity between 
two labels as follows: given two labels {i and j) we 
count how many times one label (i) is assigned when 
the other is the correct one (j), and viceversa. The 
average of the two normalized counts defines the 
similarity between both labels. Similarity is trans¬ 
formed into distance as dist{i,j) = 1 — sim{i,j). 
Note that the matrix is square and symmetric, and 
each element at position (i, j) determines the dis- 

















tance between labels i and j. Once obtained the dis¬ 
tance matrix, we apply agglomerative hierarchical 
clustering to obtain the partitioning. The code and 
data to produce the hierarchical structure is supplied 
in the supporting material. 

Once the hierarchical structure is defined, a bi¬ 
nary multi-class Support Vector Machine hierar¬ 
chy ( [Chang and Lin, 201 1| ) hierarchy is built (2- 
way SVM tree) by recursively dividing the train¬ 
ing dataset of K classes into two subsets of classes. 
Thus, each node decides if a sample belongs to a 
specific label or to one of the sub-hierarchy labels. 

For the experiments we define two different hi¬ 
erarchical configurations: HI and H2. HI starts 
discarding the most disimilar labels (i.e. non 
domain V5 rest) and finishes with the most 
similar ones (i.e. correct V5 partially 
correct). In contrast, H2 starts from the most 
similar labels and walks into the most disimilar ones. 

Each node of the tree is trained independently by 
mapping the whole training set to the classes it needs 
to handle. The leaf label is the positive class (e.g. 
non domain on the top of figure]^ and the re¬ 
maining labels under the sub-hierarchy make up the 
negative class. This approach is similar to the all¬ 
pairs approach mentioned in Allwein et al. (20011 as 
some instances are ignored when training the non¬ 
top binary classifiers. Thaf is, when training the 
non-top binary classifiers we do not consider the in¬ 
stances of the classes handled in higher levels of the 
hierarchy. For testing we apply the whole hierarchy 
to each incoming instance. We evaluate the archi¬ 
tecture in two ways: 1) measuring the local perfor¬ 
mance of each level, and 2) measuring the overall 
performance of the tree. 


4 Experiments and results 

To train and evaluate the two hierarchical configura¬ 
tions we have used the SRA train and test data de¬ 
scribed in section]^ Each binary SVM classifier of 
the hierarchy has been tuned using 5-fold cross val¬ 
idation, maximizing the micro accuracy as the ob¬ 
jective function. We have calculated level-wise and 
overall results in order to analyze both: the contribu¬ 
tion of each binary classifier, and the overall results. 
Tabled summarizes the results obtained for both hi¬ 
erarchical configurations. 


System 

Description 

F-Score 

HI overall 

Whole tree 

0.56 

HI LI 

Non domain vs Rest 

0.988 

HI L2 

Irrelevant vs Rest 

0.856 

HI L3 

Contradictory vs Rest 

0.795 

HI L4 

Paitially vs Correct 

0.745 

System 

Description 

F-Score 

H2 overall 

Whole tree 

0.568 

H2L1 

Correct vs Rest 

0.762 

H2L2 

Partially vs Rest 

0.699 

H2 L3 

Contradictory vs Rest 

0.78 

H2 L4 

Irrelevant vs Non domain 

0.96 


Table 1: Micro F-Scores on 5 fold cross-validation for 
each level (’L’) of the two hierarchical configurations 
(’H’). 


Concerning the level evaluation, the highest F- 
score value (0.988) is obtained when dealing with 
instances of the most distanced class (top level of 
the HI configuration). Accuracy decreases as de¬ 
scending on the hierarchy. Just the opposite effect is 
shown in the other hierarchy, as it deals first with the 
most similar classes. Actually, H2 obtains the high¬ 
est F-score value (0.96) in the lowest level. This re¬ 
sult has much to do with the confusion matrix shown 
in figure [T] as if is fhe class with lowest false posi¬ 
tive rate (non.domain predicted row). As regards 
overall results, accuracy drops considerably in com¬ 
parison to the values obtained in the level results. In 
our opinion, this result is related to the error prop¬ 
agation in the tree. We define fhe error propagation 
as the instances that really belonging to one class are 
incorrectly classified and continue down the tree to 
the next level, degrading the performance in subse¬ 
quent levels. 

Comparing performance among the hierarchical 
approaches and a 5-way flat SVM model (also tuned 
on the training set) results turn to be similar. The 
micro F-score results for HI and H2 are 0.56 and 
0.568 respectively (0.571 for the flat SVM), and the 
macro F-score results for HI and H2 are 0.553 and 
0.566 respectively (0.566 for the flat SVM). 

Finally, we evaluated the H2 systerr0 using the 
test data provided for the SRA task. The micro F- 
score obtained for H2 in the test set was 0.426 (0.415 
for the flat SVM) and the macro F-score was 0.408 
(0.39 for the flat SVM). Contrary to what happened 
in the cross-validation setting, the hierarchy based 
approach outperformed the SVM model. This in- 

^We present the results of H2 which outperformed HI. 












dicates that the hierarchy based approach prevents 
from overfitting as the hierarchy structure imposes 
coherent biases that correlates the problem formula¬ 
tion and label relationship. 

5 Discussion 

The present paper describes ongoing work on build¬ 
ing hierarchy based models to automatically grade 
student answers as an alternative to typical N-way 
models. We show that the elucidation of the rela¬ 
tionship between classes in the Semeval-2013 task is 
more difficult than it could be expected using basic 
hierarchical structures. Even promising results are 
obtained at level-wise training, the overall accuracy 
is deteriorated when the whole system is combined. 
Nevertheless, results show that the hierarchical ap¬ 
proach outperforms the flat 5-way SVM at test. 

In order to address this issue, we consider that a 
deeper label interdependency analysis is necessary 
to effectively discriminate among classes with high 
confusion rates. That is why as a first step we con¬ 
ducted an error analysis and found crucial for fu¬ 
ture work to specialize the feature set being used 
for training. Actually, we think that specific feafures 
may considerably confribufe fo cerfain levels of fhe 
hierarchy. For insfance, feafures fhaf explicifly han¬ 
dle negafion are crifical when classifying confradic- 
fory insfances, buf useless in ofher sfages. 

As regards fhe error analysis we have faken fhe 
confusion mafrix of fhe whole free performance 
and analyzed fhe incorrecfly classified major 
groups. Ouf of fhe fofal error rafe of H2 (42% of 
insfances incorrecfly classified) fhe mosf problem¬ 
atic misclassificafions are disfribufed across fhe 
following categories, which we briefly describe: a) 
partially_correct insfances being classified 
as correct: 18% error rate; b) correct in¬ 
sfances being classified as partially_correct: 
14% error rate; c) contradictory insfances 
being classified as correct: 11% error rafe; 

d) contradictory insfances being classi¬ 
fied as partially_correct: 9% error rafe; 

e) irrelevant insfances being classified as 
correct: 8% error rafe. 

The mentioned five cases accounf for fhe 60% of 
fhe fofal error rale, while all of fhe ofher error cases 
accounf each for less fhan 6% error. Our analysis 


concluded fhaf, as expecfed, mosf of fhe errors re¬ 
quire nof only general lexical feafures buf also more 
fine grained ones so fhaf fo be correcfly classified. 
For insfance, we Ihink fhaf fhe approach faken in fhe 
inlerprelable pilol described in Agirre el al. (20151 
can be effeclive for cases a), b) and e) so fhaf fo ob- 
fain a more fine grained linkage belween fhe con- 
cepls of fhe sludenl answer and fhe concepls of fhe 
reference answer. The usefulness of Ihis approach 
resides in making alignmenls belween concepls in 
fhe reference-sludenl pair. 

On fhe conlrary, for contradictory related 
misclassificafions, such as: c) and d) we fhink fhaf 
specific feafures able fo handle negafion are cru¬ 
cial fo improve performance. In facl, we perform a 
deeper analysis of fhe c) case by randomly samplig 
20 insfances and found fhaf even for humans fhe 
annofaled gold values are ambiguous cerfain times. 
Ouf of fhe 20 sludenl answers analyzed we found 
5 ambiguous cases, some of Ihem relafed fo fhe us¬ 
age of negafive polarily particles. For example, for 
fhe question ’’Why was bulb C off when switch Z 
was open?” and reference answer ’’There was no 
longer a closed path containing Bulb C and the bat¬ 
tery” fhe following sludenl answer is annofaled as 
contradictory: ’’switch Z created a gap in the 
closed circuit required for bulb C”. Though, our 
hierarchical system scores if as correct, which 
we fhink is fhe mosf suilable grade. Jusl fhe same 
happens wilh fhe following reference-sludenl pair: 
’’(Reference) The more blocks the truck carries, the 
less distance the truck travels in 10 seconds.” and ” 
(Student) It went farther with less blocks and it went 
no farther with more blocks.” 

In all, fo grade sludenl answers is a challenging 
lask fhaf requires fo analyze and identify errors and 
misconcepfions based on reference responses. As 
regards sludenl answer clustering, even some work 


has been done, such as fhe work described in Basu 


el al. (20131 and Zesch el al. (20151 if is still an open 


research line fo prove whelher clustering slruclures 
can meaningfully improve performance. 


6 Future work 

Even if fhe hierarchical decomposition explained 
has showed negative resulls, we plan fo continue 
analyzing disfincf configuralions fo specialize fhe 






concrete feature list for each level of the hierarchy. 
Actually, we think that the system requires new at¬ 
tributes in order to improve. Distinct attributes may 
contribute differently as regards each level, for in¬ 
stance, features that explicitly handle negation seem 
to be critical in the modules responsible for classi¬ 
fying contradictory instances, but useless in other 
modules. Moreover, we also plan to explore new 
similarity features of top performing systems of the 
SemEval task that could contribute to our system. 
In all, we plan to continue analyzing strategies and 
developing techniques in order to improve the pro¬ 
vided automatic scores. 
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